Skip to content

[LLM:Feature] Add Segment Mode, Speed up metal llm for 30%-100%#4543

Open
jxt1234 wants to merge 2 commits into
alibaba:masterfrom
jxt1234:feature/llm_mini
Open

[LLM:Feature] Add Segment Mode, Speed up metal llm for 30%-100%#4543
jxt1234 wants to merge 2 commits into
alibaba:masterfrom
jxt1234:feature/llm_mini

Conversation

@jxt1234

@jxt1234 jxt1234 commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

Description

Module

Type

  • Feature
  • Bugfix
  • Perf
  • Refact
  • Style
  • Doc
  • Test
  • Chore

Checklist

  • Commit message follows [Module:Type] Description format
  • Code compiles without errors
  • Tested on relevant platform(s)
  • No unrelated format or style changes included

@jxt1234 jxt1234 force-pushed the feature/llm_mini branch 2 times, most recently from 4418abe to bebbc9a Compare June 15, 2026 11:02
@wangzhaode wangzhaode self-assigned this Jun 15, 2026
@jxt1234 jxt1234 force-pushed the feature/llm_mini branch from bebbc9a to a58b9d6 Compare June 16, 2026 02:43
@jxt1234 jxt1234 changed the title [LLM:Feature] Support Segment Mode, currently only support metal backend [LLM:Feature] Add Segment Mode, Speed up metal llm for 30%-100% Jun 16, 2026
@jxt1234 jxt1234 force-pushed the feature/llm_mini branch from a58b9d6 to 13e553a Compare June 18, 2026 02:42
@wangzhaode

wangzhaode commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

这个 PR 的优化方向很有价值,尤其是针对小模型 GPU decode 场景,RoPE 融合、减少 NC4HW4 来回转换、TopK/embedding/logit 拆分这些点都比较关键。

不过建议考虑把改动拆开提交/合并,降低 review 和回归定位成本:

  1. 先拆出通用基础能力:

    • OpType_RoPE 及 CPU/Metal/OpenCL 后端实现
    • Attention output_c4 / attnScale
    • NC4HW4 LayerNorm / binary LayerNorm
    • MUL_SILU
    • TopKV 优化
    • SharedGather / prearrange clone 相关能力
    • converter 里的 layout 传播规则调整

    这些能力是通用的,后续现有 torch -> ONNX -> MNN 导出路径也可以复用,建议单独配回归测试和性能数据。

  2. 再单独提交 SegmentLlm / safetensor workflow 路径:

    • segment.py
    • safetensors converter / workflow json
    • decoder.mnnembed.mnnlogit.mnntopk.mnn 分段导出
    • SegmentLlm runtime 加载和推理逻辑

    这部分更像新的 LLM fast path,可以作为 opt-in 路径独立评审,重点验证模型覆盖、采样行为、配置兼容性和与现有 llm.mnn 路径的一致性。

这样拆分后,基础优化可以先沉淀到主路径中,也方便定位是否是 backend primitive、layout pass、converter,还是 SegmentLlm runtime 引入的问题。整体方向支持,但建议不要把通用底座能力和新的 segment 运行路径绑在一个大 PR 里一次性合入。

@jxt1234 jxt1234 force-pushed the feature/llm_mini branch from 13e553a to 25fce5f Compare June 18, 2026 08:31
@jxt1234 jxt1234 force-pushed the feature/llm_mini branch from 25fce5f to 9c8fdb6 Compare June 18, 2026 09:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants